Search CORE

56 research outputs found

Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Author: Daelemans Walter
Zavrel Jakub
Publication venue
Publication date: 01/01/2000
Field of study

This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.Comment: 4 page

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Memory-Based Learning: Using Similarity for Smoothing

Author: Daelemans Walter
Zavrel Jakub
Publication venue
Publication date: 01/01/1997
Field of study

This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the advantage of automatically specifying a suitable domain-specific hierarchy between most specific and most general conditioning information without the need for a large number of parameters. We report two applications of this approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art performance in both domains, and allows the easy integration of diverse information sources, such as rich lexical representations.Comment: 8 pages, uses aclap.sty, To appear in Proc. ACL/EACL 9

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

e^+e^- Annihilations into Quasi-two-body Final States at 10.58 GeV

Author: Hans Halteren
Jakub Zavrel
Walter Daelemans
Publication venue
Publication date: 01/01/1998
Field of study

We report the first observation of

e^+e^-

annihilations into hadronic states of positive

C

-parity,

\rho^0 \rho^0

and

\phi\rho^0

. The angular distributions support two-virtual-photon annihilation production. We also report the observations of

e^+e^-\to \phi\eta

and a preliminary result on

e^+e^-\to \rho^+\rho^-

.Comment: Invited talk, 7 pages, 4 postscript figures, contributed to the Workshop on Exclusive Reactions at High Momentum Transfer, 21-24 May 2007, Jla

arXiv.org e-Print Archive

CiteSeerX

Crossref

Institutional Repository Universiteit Antwerpen

Radboud Repository

Tilburg University Repository

MBT: A Memory-Based Part of Speech Tagger-Generator

Author: Berck Peter
Daelemans Walter
Gillis Steven
Zavrel Jakub
Publication venue
Publication date: 01/01/1996
Field of study

We introduce a memory-based approach to part of speech tagging. Memory-based learning is a form of supervised learning based on similarity-based reasoning. The part of speech tag of a word in a particular context is extrapolated from the most similar cases held in memory. Supervised learning approaches are useful when a tagged corpus is available as an example of the desired output of the tagger. Based on such a corpus, the tagger-generator automatically builds a tagger which is able to tag new text the same way, diminishing development time for the construction of a tagger considerably. Memory-based tagging shares this advantage with other statistical or machine learning approaches. Additional advantages specific to a memory-based approach include (i) the relatively small tagged corpus size sufficient for training, (ii) incremental learning, (iii) explanation capabilities, (iv) flexible integration of information in case representations, (v) its non-parametric nature, (vi) reasonably good results on unknown words without morphological analysis, and (vii) fast learning and tagging. In this paper we show that a large-scale application of the memory-based approach is feasible: we obtain a tagging accuracy that is on a par with that of known statistical approaches, and with attractive space and time complexity properties when using {\em IGTree}, a tree-based formalism for indexing and searching huge case bases.} The use of IGTree has as additional advantage that optimal context size for disambiguation is dynamically computed.Comment: 14 pages, 2 Postscript figure

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Forgetting Exceptions is Harmful in Language Learning

Author: Bosch Antal van den
Daelemans Walter
Zavrel Jakub
Publication venue
Publication date: 22/12/1998
Field of study

We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.Comment: 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex styles. Pre-print version of article to appear in Machine Learning 11:1-3, Special Issue on Natural Language Learning. Figures on page 22 slightly compressed to avoid page overloa

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus

Author: Eynde £
Frank Van
Jakub Zavrel
Walter Daelemans
Publication venue
Publication date: 11/04/2020
Field of study

Abstract This paper describes the lemmatisation and tagging guidelines developed for the "Spoken Dutch Corpus", and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most effective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator

CiteSeerX

Theoretical and Practical Design Approach of Wireless Power Systems

Author: Frivaldsky Michal
Kindl Vladimir
Skorvaga Jakub
Zavrel Martin
Publication venue: 'IntechOpen'
Publication date: 27/01/2021
Field of study

The paper introduces the main issues concerned with the conceptual design process of wireless power systems. It analyses the electromagnetic design of the inductive magnetic coupler and proposes the key formulas to optimize its electrical parameters for a particular load. For this purpose, a very detailed analysis is given focusing on the mathematical concept procedure for determination of the key factors influencing proper coupling coils design. It also suggests basic topologies for conceptual design of power electronics and discusses its proper connection to the grid. The proposed design strategy is verified by experimental laboratory measurement including analyses of leakage magnetic field

IntechOpen

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Author: Abonizio Hugo
Bonifacio Luiz
Fadaee Marzieh
Jeronymo Vitor
Lotufo Roberto
Nogueira Rodrigo
Zavrel Jakub
Publication venue
Publication date: 26/05/2023
Field of study

Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tp

arXiv.org e-Print Archive

An Empirical Re-Examination of Weighted Voting for k-NN

Author: Jakub Zavrel
Publication venue
Publication date: 01/01/1997
Field of study

For some applications of k-nearest neighbor classifiers, the best results are obtained at a relatively large value of k. With the majority voting method, these results can be suboptimal. In this paper the performance of various weighted voting methods is tested on a number of machine learning datasets. The results show that weighted voting is often superior to majority voting, and that the linear weighting function proposed by Dudani [5] often yields slightly better results than the inverse distance function that has commonly been used in more recent work. 1 Introduction Classification algorithms from the family of k-nearest neighbors (k-NN) [6] [4] or Instance Based Learning [1] [13] [12] are based on the idea that similar instances of a problem tend to have similar solutions. The basic algorithm stores a set of classified cases, represented by feature-value vectors, in memory. When a new case -- the query -- is to be classified, the k vectors with the smallest distance to it are se..

CiteSeerX

Tilburg University Repository

Feature-rich memory-based classification for shallow nlp and information extraction

Author: Jakub Zavrel
Walter Daelemans
Publication venue: Springer Physica
Publication date: 01/01/2003
Field of study

Abstract. Memory-Based Learning (MBL) is based on the storage of all available training data, and similarity-based reasoning for handling new cases. By interpreting tasks such as POS tagging and shallow parsing as classification tasks, the advantages of MBL (implicit smoothing of sparse data, automatic integration and relevance weighting of information sources, handling exceptional data) contribute to state-of-the-art accuracy. However, Hidden Markov Models (HMM) typically achieve higher accuracy than MBL (and other Machine Learning approaches) for tasks such as POS tagging and chunking. In this paper, we investigate how the advantages of MBL, such as its potential to integrate various sources of information, come to play when we compare our approach to HMMs on two Information Extraction (IE) datasets: the well-known Seminar Announcement data set and a new German Curriculum Vitae data set. 1 Memory-Based Language Processing Memory-Based Learning (MBL) is a supervised classification-based learning method. A vector of feature values (an instance) is associated with a class by

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository